Dunnett’s Test — Multiple Comparisons vs a Control#

Dunnett’s test is designed for experiments where you compare several treatments to a single control and want to control the family-wise error rate (FWER).

What you’ll learn#

  • When Dunnett’s test is the right tool (and when it isn’t)

  • The test statistic and the “multiple comparisons vs control” adjustment

  • How to interpret adjusted p-values and simultaneous confidence intervals

  • A NumPy-only implementation (Monte Carlo approximation) with Plotly visuals

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

# Plotly rendering (CKC convention)
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

# Reproducibility
rng = np.random.default_rng(7)

np.set_printoptions(precision=4, suppress=True)

1) When to use Dunnett’s test#

Use Dunnett’s test when:

  • You have one control group and m treatment groups.

  • You care about treatment vs control comparisons (not every pair).

  • You want to control the probability of any false positive across those m tests (FWER ≤ α).

Typical examples:

  • Dose–response studies (multiple doses vs placebo)

  • A/B/n experiments with a baseline

  • Regression testing (new versions vs current control)

Not the right tool if:

  • You want all pairwise comparisons (use Tukey’s HSD).

  • Variances differ strongly between groups (consider Welch-type approaches).

2) What problem it solves (multiple testing)#

For each treatment group i (compared to control c), the hypotheses are:

  • Two-sided:
    \(H_{0,i}: \mu_i = \mu_c\) vs \(H_{1,i}: \mu_i \ne \mu_c\)

  • One-sided (greater):
    \(H_{0,i}: \mu_i \le \mu_c\) vs \(H_{1,i}: \mu_i > \mu_c\)

If you run m separate tests at level \(\alpha\), the chance of getting at least one false positive grows with m. Under independence (a useful intuition),

\[\mathrm{FWER} = 1 - (1 - \alpha)^m.\]

Dunnett’s test chooses a single critical value so that the maximal evidence against \(H_0\) across treatments triggers a rejection only with probability \(\alpha\) under the global null.

alpha = 0.05
m = np.arange(1, 21)

fwer_independent = 1 - (1 - alpha) ** m

fig = px.line(
    x=m,
    y=fwer_independent,
    markers=True,
    labels={"x": "Number of comparisons (m)", "y": "FWER (independence intuition)"},
    title="Why multiple t-tests inflate false positives",
)
fig.add_hline(y=alpha, line_dash="dash", annotation_text="Target α", annotation_position="bottom right")
fig.show()

3) Key idea: correlated t-statistics#

All comparisons share the same control mean, so the treatment-vs-control statistics are correlated. Dunnett’s test uses the joint distribution of these correlated statistics to get a critical value \(c_\alpha\) such that:

  • Two-sided version: $\(\mathbb{P}\left(\max_{i=1..m} |T_i| > c_\alpha\right) = \alpha.\)$

This controls FWER and is typically less conservative than Bonferroni, because it accounts for correlation.

4) Model & assumptions#

Dunnett’s test is built on the one-way ANOVA model:

\[y_{gj} = \mu_g + \varepsilon_{gj}, \qquad \varepsilon_{gj} \overset{iid}{\sim} \mathcal{N}(0,\sigma^2)\]

Assumptions (the same spirit as pooled-variance ANOVA):

  • observations are independent

  • residuals are approximately normal (often reasonably robust for moderate n)

  • homoscedasticity: all groups share the same variance \(\sigma^2\)

5) Test statistic (pooled-variance, ANOVA-style)#

Let:

  • control size \(n_c\) and treatment size \(n_i\)

  • sample means \(\bar y_c\) and \(\bar y_i\)

  • pooled within-group mean square \(\mathrm{MSE}\) (from one-way ANOVA)

  • degrees of freedom \(\mathrm{df} = N_{\text{total}} - k\) where \(k\) is the number of groups

Then for each treatment i:

\[T_i = \frac{\bar y_i - \bar y_c}{\sqrt{\mathrm{MSE}\,(1/n_i + 1/n_c)}}.\]

Dunnett’s test differs from running m t-tests not by the statistic, but by the multiple-comparison adjustment used for decisions, p-values, and confidence intervals.

6) Correlation structure (why “Dunnett” is special)#

For two different treatments i and j, the differences \(\bar y_i - \bar y_c\) and \(\bar y_j - \bar y_c\) share the same control mean. That induces positive correlation.

A convenient approximation for the correlation of the standardized mean differences is:

\[\mathrm{Corr}(T_i, T_j) \approx \frac{1}{\sqrt{(1 + n_c/n_i)(1 + n_c/n_j)}} \quad (i\ne j).\]

Balanced design example: if \(n_c = n_i = n\), then \(\mathrm{Corr}(T_i, T_j) = 1/2\).

7) NumPy-only Dunnett via Monte Carlo (educational, but accurate with enough samples)#

The exact Dunnett p-values/critical values come from a multivariate t distribution. Instead of calling a stats library, we’ll simulate it directly:

  1. Build the correlation matrix \(R\) of the treatment-vs-control statistics.

  2. Sample \(Z \sim \mathcal{N}(0, R)\).

  3. Sample an independent \(U \sim \chi^2_{\mathrm{df}}\).

  4. Form \(T = Z / \sqrt{U/\mathrm{df}}\) (this is multivariate t).

  5. Use the distribution of \(\max_i |T_i|\) to get a critical value and adjusted p-values.

This is the core principle behind Dunnett’s adjustment.

from typing import Dict, Literal, Tuple

Alternative = Literal["two-sided", "greater", "less"]


def pooled_mse(groups: Dict[str, np.ndarray]) -> Tuple[float, int]:
    '''Pooled within-group mean square (ANOVA MSE).

    MSE = SSE / df, where SSE = sum_g sum_j (y_gj - mean_g)^2
    and df = N_total - k_groups.
    '''
    if len(groups) < 2:
        raise ValueError("Need at least two groups (including control).")

    sse = 0.0
    n_total = 0
    for x in groups.values():
        x = np.asarray(x, dtype=float)
        if x.ndim != 1:
            raise ValueError("Each group must be a 1D array.")
        n_total += x.size
        sse += float(np.sum((x - x.mean()) ** 2))

    k = len(groups)
    df = n_total - k
    if df <= 0:
        raise ValueError("Not enough total samples to estimate pooled variance.")

    return sse / df, int(df)


def dunnett_corr(n_control: int, n_treatments: np.ndarray) -> np.ndarray:
    '''Correlation matrix for treatment-vs-control standardized mean differences.'''
    n_treatments = np.asarray(n_treatments, dtype=float)
    if np.any(n_treatments <= 0) or n_control <= 0:
        raise ValueError("Sample sizes must be positive.")

    a = 1.0 + (n_control / n_treatments)
    corr = 1.0 / np.sqrt(np.outer(a, a))
    np.fill_diagonal(corr, 1.0)
    return corr


def dunnett_t_stats(
    groups: Dict[str, np.ndarray],
    control: str,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float, int]:
    '''Compute Dunnett t-statistics vs control.

    Returns (treat_names, diffs, t_stats, mse, df)
    '''
    if control not in groups:
        raise KeyError(f"Control group '{control}' not found.")

    mse, df = pooled_mse(groups)

    names = [g for g in groups.keys() if g != control]
    if len(names) == 0:
        raise ValueError("Need at least one treatment group besides control.")

    y_c = np.asarray(groups[control], dtype=float)
    n_c = y_c.size
    mean_c = float(y_c.mean())

    diffs = []
    t_stats = []

    for g in names:
        y = np.asarray(groups[g], dtype=float)
        n_i = y.size
        diff = float(y.mean()) - mean_c
        se = float(np.sqrt(mse * (1.0 / n_i + 1.0 / n_c)))
        diffs.append(diff)
        t_stats.append(diff / se)

    return np.array(names), np.array(diffs, dtype=float), np.array(t_stats, dtype=float), float(mse), int(df)


def simulate_max_stat(
    df: int,
    corr: np.ndarray,
    n_sim: int,
    alternative: Alternative,
    rng: np.random.Generator,
) -> np.ndarray:
    '''Simulate the max statistic used by Dunnett’s adjustment.'''
    corr = np.asarray(corr, dtype=float)
    if corr.ndim != 2 or corr.shape[0] != corr.shape[1]:
        raise ValueError("corr must be a square matrix")

    m = corr.shape[0]
    if m < 1:
        raise ValueError("Need at least one treatment.")

    L = np.linalg.cholesky(corr)
    z = rng.standard_normal((n_sim, m)) @ L.T

    u = rng.chisquare(df, size=n_sim) / df
    t = z / np.sqrt(u)[:, None]

    if alternative == "two-sided":
        return np.max(np.abs(t), axis=1)
    if alternative == "greater":
        return np.max(t, axis=1)
    if alternative == "less":
        return np.max(-t, axis=1)

    raise ValueError("alternative must be 'two-sided', 'greater', or 'less'.")


def dunnett_test(
    groups: Dict[str, np.ndarray],
    control: str,
    alpha: float = 0.05,
    alternative: Alternative = "two-sided",
    n_sim: int = 200_000,
    seed: int = 7,
) -> Dict[str, object]:
    '''Monte Carlo Dunnett test (NumPy-only).

    Returns a dict containing:
    - names: treatment names
    - diffs: mean(treatment) - mean(control)
    - t: t-statistics
    - p_adj: Dunnett-adjusted p-values (single-step)
    - ci_low, ci_high: simultaneous CI for mean differences (two-sided)
    - crit: Dunnett critical value for the max statistic
    - mse, df: pooled variance estimate and df
    - corr: correlation matrix among T_i
    '''
    if not (0 < alpha < 1):
        raise ValueError("alpha must be in (0,1).")

    names, diffs, t_stats, mse, df = dunnett_t_stats(groups, control)

    n_c = int(np.asarray(groups[control]).size)
    n_t = np.array([int(np.asarray(groups[g]).size) for g in names], dtype=int)
    corr = dunnett_corr(n_c, n_t)

    sim_rng = np.random.default_rng(seed)
    max_stat = simulate_max_stat(df=df, corr=corr, n_sim=n_sim, alternative=alternative, rng=sim_rng)

    crit = float(np.quantile(max_stat, 1 - alpha))

    if alternative == "two-sided":
        t_for_p = np.abs(t_stats)
    elif alternative == "greater":
        t_for_p = t_stats
    else:
        t_for_p = -t_stats

    p_adj = np.array([(max_stat >= float(t0)).mean() for t0 in t_for_p], dtype=float)

    # Two-sided simultaneous CI for mean differences
    se = np.sqrt(mse * (1.0 / n_t + 1.0 / n_c))
    ci_low = diffs - crit * se
    ci_high = diffs + crit * se

    reject = p_adj < alpha

    return {
        "names": names,
        "diffs": diffs,
        "t": t_stats,
        "p_adj": p_adj,
        "ci_low": ci_low,
        "ci_high": ci_high,
        "reject": reject,
        "crit": crit,
        "mse": mse,
        "df": df,
        "corr": corr,
        "max_stat_sim": max_stat,
        "alternative": alternative,
        "alpha": alpha,
    }

8) Worked example (with visuals)#

We’ll simulate a small experiment with one control and three treatments. Two treatments have real effects, one does not.

groups = {
    "Control": rng.normal(loc=0.0, scale=1.0, size=22),
    "Dose 1": rng.normal(loc=0.25, scale=1.0, size=20),
    "Dose 2": rng.normal(loc=0.75, scale=1.0, size=18),
    "Dose 3": rng.normal(loc=0.00, scale=1.0, size=24),
}

{g: (x.size, float(x.mean())) for g, x in groups.items()}
{'Control': (22, -0.3820492834449359),
 'Dose 1': (20, -0.11140381300226458),
 'Dose 2': (18, 0.9206856310625974),
 'Dose 3': (24, -0.08548638452023795)}
fig = go.Figure()
for name, x in groups.items():
    fig.add_trace(go.Violin(y=x, name=name, box_visible=True, meanline_visible=True))

fig.update_layout(
    title="Simulated data by group",
    yaxis_title="Outcome",
    violinmode="group",
)
fig.show()
res = dunnett_test(groups, control="Control", alpha=0.05, alternative="two-sided", n_sim=250_000, seed=1)
res["crit"], res["df"], res["mse"]
(2.392486887925362, 80, 0.7378219350831077)
def print_results(res: Dict[str, object]) -> None:
    names = res["names"]
    diffs = res["diffs"]
    t = res["t"]
    p = res["p_adj"]
    lo = res["ci_low"]
    hi = res["ci_high"]
    reject = res["reject"]

    header = f"{'Treatment':<12} {'Diff':>9} {'t':>9} {'p_adj':>10} {'CI_low':>10} {'CI_high':>10} {'Reject':>8}"
    print(header)
    print("-" * len(header))
    for i in range(len(names)):
        print(
            f"{str(names[i]):<12} "
            f"{diffs[i]:9.4f} "
            f"{t[i]:9.4f} "
            f"{p[i]:10.4f} "
            f"{lo[i]:10.4f} "
            f"{hi[i]:10.4f} "
            f"{str(bool(reject[i])):>8}"
        )

print_results(res)
Treatment         Diff         t      p_adj     CI_low    CI_high   Reject
--------------------------------------------------------------------------
Dose 1          0.2706    1.0198     0.6145    -0.3643     0.9056    False
Dose 2          1.3027    4.7720     0.0000     0.6496     1.9559     True
Dose 3          0.2966    1.1697     0.5124    -0.3100     0.9031    False
names = res["names"]
diffs = res["diffs"]
lo = res["ci_low"]
hi = res["ci_high"]
reject = res["reject"]

colors = np.where(reject, "#d62728", "#1f77b4")

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=names,
        y=diffs,
        marker_color=colors,
        error_y=dict(type="data", array=hi - diffs, arrayminus=diffs - lo),
        name="Mean difference",
    )
)
fig.add_hline(y=0, line_dash="dash")
fig.update_layout(
    title="Treatment − Control mean differences with Dunnett simultaneous CI",
    yaxis_title="Mean difference",
)
fig.show()
corr = res["corr"]

fig = px.imshow(
    corr,
    text_auto=".2f",
    color_continuous_scale="Blues",
    title="Correlation among treatment-vs-control t-statistics",
    labels=dict(x="Treatment", y="Treatment", color="corr"),
)
fig.update_xaxes(ticktext=res["names"], tickvals=list(range(len(res["names"])) ))
fig.update_yaxes(ticktext=res["names"], tickvals=list(range(len(res["names"])) ))
fig.show()
max_stat = res["max_stat_sim"]
crit = res["crit"]
obs = np.abs(res["t"])

fig = go.Figure()
fig.add_trace(go.Histogram(x=max_stat, nbinsx=60, name="Simulated max |T|", opacity=0.75))
fig.add_vline(x=crit, line_dash="dash", line_color="black", annotation_text=f"crit ≈ {crit:.3f}")
for t0 in obs:
    fig.add_vline(x=float(t0), line_color="red", opacity=0.35)

fig.update_layout(
    title="Dunnett null distribution: max |T| (Monte Carlo)",
    xaxis_title="max |T|",
    yaxis_title="count",
)
fig.show()

9) How to interpret Dunnett results#

For each treatment vs control:

  • Adjusted p-value: the probability (under the global null) that any comparison would look at least this extreme.
    If p_adj < α, you can claim that treatment differs from control while controlling FWER.

  • Sign of the difference / t-statistic: direction of the effect (treatment higher vs lower than control).

  • Simultaneous confidence interval: a set of intervals for all treatment–control differences such that all of them are correct at once with probability \(1-\alpha\). If the CI excludes 0, it matches the “reject” decision.

What it does not mean:

  • p_adj = 0.03 is not the probability the null is true.

  • A non-rejection does not prove equality; it means “not enough evidence given the noise and sample sizes.”

10) Quick insight: FWER control (simulation)#

Below we simulate the global null and compare how often we get at least one false positive.

  • Naive: compare each treatment to control at level \(\alpha\).

  • Bonferroni: compare each at level \(\alpha/m\).

  • Dunnett: use the max-|T| critical value.

Everything uses the same df and correlation induced by the shared control.

def simulate_abs_t(df: int, n_sim: int, rng: np.random.Generator) -> np.ndarray:
    z = rng.standard_normal(n_sim)
    u = rng.chisquare(df, size=n_sim) / df
    return np.abs(z / np.sqrt(u))


m = len(res["names"])
df = int(res["df"])

sim_rng = np.random.default_rng(123)
abs_t = simulate_abs_t(df=df, n_sim=400_000, rng=sim_rng)

crit_unadj = float(np.quantile(abs_t, 1 - alpha))
crit_bonf = float(np.quantile(abs_t, 1 - alpha / m))
crit_dun = float(res["crit"])

crit_unadj, crit_bonf, crit_dun
(1.9909334640191727, 2.4425017266737616, 2.392486887925362)
# Simulate multivariate t under the global null
n_exp = 80_000

sim_rng = np.random.default_rng(456)
L = np.linalg.cholesky(res["corr"])
z = sim_rng.standard_normal((n_exp, m)) @ L.T
u = sim_rng.chisquare(df, size=n_exp) / df
T = z / np.sqrt(u)[:, None]

any_unadj = (np.abs(T) > crit_unadj).any(axis=1)
any_bonf = (np.abs(T) > crit_bonf).any(axis=1)
any_dun = (np.abs(T) > crit_dun).any(axis=1)

fwer = {
    "Naive (α per test)": float(any_unadj.mean()),
    "Bonferroni": float(any_bonf.mean()),
    "Dunnett": float(any_dun.mean()),
}

fig = px.bar(
    x=list(fwer.keys()),
    y=list(fwer.values()),
    labels={"x": "Method", "y": "Empirical FWER under global null"},
    title=f"FWER control at α={alpha} (Monte Carlo)",
)
fig.add_hline(y=alpha, line_dash="dash", annotation_text="Target α", annotation_position="bottom right")
fig.update_yaxes(range=[0, max(list(fwer.values()) + [alpha]) * 1.25])
fig.show()

11) Pitfalls & diagnostics#

  • Heteroscedasticity (unequal variances) can distort pooled-variance methods; check residual spread by group.

  • Non-normality can matter for small n; use residual plots and consider robust/nonparametric alternatives.

  • Only vs control: if you later start caring about treatment-vs-treatment, switch to an all-pairs method.

  • Power: Dunnett is usually more powerful than Bonferroni for the same FWER because it leverages correlation.

Practical workflow:

  1. Inspect plots (group distributions, residuals).

  2. If assumptions look reasonable, run Dunnett.

  3. Report mean differences with simultaneous CIs and adjusted p-values.

12) Exercises#

  1. Increase n_sim and see how stable the critical value res["crit"] becomes.

  2. Make the design balanced (same n for all groups) and verify that the correlation off-diagonal is close to 0.5.

  3. Change alternative to "greater" and interpret the results when effects are negative.

References#

  • Dunnett, C. W. (1955). A multiple comparison procedure for comparing several treatments with a control.

  • Many statistical software packages implement exact Dunnett p-values via multivariate t CDFs; this notebook uses Monte Carlo to keep the mechanics transparent.